Skip to content

feat(failover): reactive failover walker#468

Open
quangdang46 wants to merge 3 commits into
masterfrom
feat/A8-failover-walker
Open

feat(failover): reactive failover walker#468
quangdang46 wants to merge 3 commits into
masterfrom
feat/A8-failover-walker

Conversation

@quangdang46

Copy link
Copy Markdown
Owner

Summary

Implements the reactive failover walker for A8 — a per-session state machine that orchestrates automatic provider failover when errors occur mid-stream.

Architecture

Composes three existing pieces:

  • classify_failover_error_message_structured (failover.rs) — classifies errors into FailoverDecision + ErrorCode
  • pick_next_fallback_route (fallback_pick.rs) — picks next best route with 3-tier ranking
  • New ReactiveFailoverWalker (failover_walker.rs) — adds the orchestration layer: per-session tracking, cooldowns, equivalence checks, internally-aborted bookkeeping

New File

crates/jcode-provider-core/src/failover_walker.rs — 536 lines, 11 tests

Key types

  • WalkState — per-session state (original_model, current_model, fallback_index, failed_models with Instant cooldowns, attempt_count)
  • PreparedFallback — result of fallback preparation
  • WalkResult — full walk result (should_failover, new_model, decision, error_code, message)
  • ReactiveFailoverWalker — main struct

Main methods

  • walk_failover() — classify error → decide → pick fallback → update state → return result
  • prepare_fallback() — pick next candidate with max-attempts check
  • find_next_available_fallback() — walk chain respecting cooldowns + equivalence
  • record_failure/record_success — cooldown management
  • is_internally_aborted/mark_internally_aborted — for consumer integration

Test coverage

  1. Session lifecycle (register → get_state → unregister)
  2. Cooldown tracking (failure → cooldown → success clears)
  3. Fallback skips equivalent models
  4. Fallback skips cooldown models
  5. Max attempts reached
  6. Rate-limited failover (picks fallback)
  7. Context-length retries (RetryNextProvider, different model)
  8. Internally-aborted tracking
  9. No routes available
  10. models_equivalent helper
  11. Unregister cleans internally_aborted set

Plan

docs/pr-plans/A8-failover-walker.md

Tracking

  • PARITY.md: entries 4.3 (Provider failover) and 9.8 (Reactive failover walker) updated
  • PR_BACKLOG.md: A8 marked ✅ Done

Add consolidated PR backlog from 13 reference repos (A-J, ~80 features)
and supporting docs (MASTER_GOAL_PROMPT, GOAL_DRIVEN_PROMPT, CONSOLIDATED_FINDINGS).
Adds  — a per-session state machine that
detects provider errors in-flight, classifies them using the existing
error classifier, picks the next available fallback route, and tracks
cooldowns/retry counts. Composes three existing pieces:
- classify_failover_error_message_structured (error classifier)
- pick_next_fallback_route (route selector)
- FailoverDecision (action determiner)

Test coverage: 11 unit tests covering session lifecycle, cooldowns,
equivalence detection, max-attempts, and integration with the
existing error-classifier for rate-limit and context-length errors.

Refs: docs/pr-plans/A8-failover-walker.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant